The main function in our codes is the RL_Algorithm and you can use it by setting (S, A, P, R, r, H, N, t):
    where S is the number of states, and the state space is defined as {0,1,...,(S-1)};
    A is the number of actions, and the action space is defined as {0,1,...,(A-1)};
    P is the transition matrix, a A * S * S array
    R is the reward function, a S * A array
    r is the discount index
    H is the step
    N is the number of loops in our algorithm
    t is the order of lam against n
and with these parameters the function will return the value and the policy under the assumption that state starts at 0.

Under the definitions, we respectively study the asymptotic property of average value, the approximation of Regret and the Comparison of different lambda.
The first and the second one are explained in the article, while the third one choose different index of lambda to find the optimal one.

Remark: Our P and R comes from the function mdptoolbox.example.rand(), and we adjust the returned reward function a
little to get a more likely reward function. Also it is normal to see different results because of randomness.

